Large and heterogeneous datasets may contain thousands of records missing spatial or taxonomic information (partially or entirely) as well as several records outside a region of interest or from doubtful sources. Such lower quality data are not fit for use in many research applications without prior amendments. The ‘Pre-filter’ step contains a series of of tests to detect, remove, and, whenever, possible, correct such erroneous or suspect records.
Important:
The results of VALIDATION test used to flag data quality is appended in separate fields in this database and retrieved as TRUE or FALSE, in which the former indicates correct records and the latter potentially problematic or suspect records.
You can install the released version of ‘BDC’ from github with:
if (!require("remotes")) install.packages("remotes")
if (!require("bdc")) remotes::install_github("brunobrr/bdc")Creating folders to save the results
Read the merged database created in the step Standardization and integration of different datasets of the BDC workflow. It is also possible to read any datasets containing the required fields to run the workflow (more details here
Standardization of character encoding
VALIDATION. This test flags records missing species names
VALIDATION. This test flags records missing partial or complete information on geographic coordinates.
VALIDATION. This test flags records with out-of-range coordinates, that is latitude > 90 or -90; longitude >180 or -180.
VALIDATION. This test flags records from doubtful source. For example, records from drawings, photographs, or multimedia objects, fossil records, among others.
ENRICHMENT. Deriving country names for records missing country names.
ENRICHMENT. Country names are standardized against a list of country names in several languages retrieved from Wikipedia.
check_pf <- bdc_country_standardized(
data = check_pf,
country = "country"
)
#> Loading auxiliary data: country names from wikipedia
#> Loading auxiliary data: world map and country iso
#> Standardizing country names
#> country found: Argentina
#> country found: Belize
#> country found: Bolivia
#> country found: Brazil
#> country found: Colombia
#> country found: Ecuador
#> country found: France
#> country found: French Guiana
#> country found: Guyana
#> country found: Honduras
#> country found: Japan
#> country found: Mexico
#> country found: Nicaragua
#> country found: Paraguay
#> country found: Suriname
#> country found: Uruguay
#> country found: Venezuela
#>
#> bdc_country_standardized:
#> The country names of 8540 records were standardized.
#> Two columns were added to the database.AMENDMENT. The mismatch between informed country and coordinates can be the result of negative or transposed coordinates. Once detected a mismatch, different coordinate transformations are made to correct the country and coordinates mismatch. Verbatim coordinates are then replaced by the rectified ones in the returned database (a database containing verbatim and corrected coordinates is also created in the “Output” folder).
check_pf <-
bdc_coordinates_transposed(
data = check_pf,
id = "database_id",
sci_names = "scientificName",
lat = "decimalLatitude",
lon = "decimalLongitude",
country = "country",
countryCode = "countryCode",
border_buffer = 0.2 # in decimal degrees (~22 km at the equator)
)
#> Correcting latitude and longitude transposed
#> Testing coordinate validity
#> Removed 1522 records.
#> Testing coordinate validity
#> Flagged 0 records.
#> Testing sea coordinates
#> Flagged 704 records.
#> Testing country identity
#> Flagged 716 records.
#> Flagged 716 of 7018 records, EQ = 0.1.
#> 716 ocurrences will be tested
#> Processing occurrences from: BR (713)
#> Processing occurrences from: CO (1)
#> Processing occurrences from: MX (1)
#> Processing occurrences from: VE (1)
#>
#> bdc_coordinates_transposed:
#> Corrected 19 records.
#> One columns were added to the database.
#> Check database containing coordinates corrected in:
#> Output/Check/01_coordinates_transposed.csvVALIDATION. Records outside one or multiple reference countries; i.e., records in other countries or at an informed distance from the coast (e.g., in the ocean). This last step avoids flagging as invalid records close to country limits (e.g., records of coast or marshland species).
check_pf <-
bdc_coordinates_country_inconsistent(
data = check_pf,
country_name = "Brazil",
lon = "decimalLongitude",
lat = "decimalLatitude",
dist = 0.1 # in decimal degrees (~11 km at the equator)
)
#> dist is assumed to be in decimal degrees (arc_degrees).
#> although coordinates are longitude/latitude, st_intersection assumes that they are planar
#>
#> bdc_coordinates_country_inconsistent:
#> Flagged 658 records.
#> One column was added to the database.ENRICHMENT. Coordinates can be derived from a detailed description of the locality associated with records in a process called retrospective geo-referencing.
xyFromLocality <- bdc_coordinates_from_locality(
data = check_pf,
locality = "locality",
lon = "decimalLongitude",
lat = "decimalLatitude"
)
#>
#> bdc_coordinates_from_locality
#> Found 1524 records missing or with invalid coordinates but with potentially useful information on locality.
#>
#> Check database in: C:/Users/Bruno Ribeiro/Documents/bdc/vignettes/Output/Check/01_coordinates_from_locality.csvCreating a column named “.summary” summarizing the results of all VALIDATION tests. This column is “FALSE” if any test was flagged as “FALSE” (i.e. potentially invalid or suspect record).
check_pf <- bdc_summary_col(data = check_pf)
#>
#> bdc_summary_col:
#> Flagged 2888 records.
#> One column was added to the database.Creating a report summarizing the results of all tests.
report <-
bdc_create_report(data = check_pf,
database_id = "database_id",
workflow_step = "prefilter")
#>
#> bdc_create_report:
#> Check the report summarizing the results of the prefilter in:
#> Output/Report
reportCreating figures (bar plots and maps) to facilitate the interpretation of the results of data quality tests.
bdc_create_figures(data = check_pf,
database_id = "database_id",
workflow_step = "prefilter")
#> Check figures in C:/Users/Bruno Ribeiro/Documents/bdc/vignettes/Output/FiguresTransposed coordinates
Coordinates and contry inconsistent
Summary of all tests
We can remove records flagged as erroneous or suspect. Records missing names or coordinates, outside a region of interest or from distrustful sources are rarely suitable to be used in biodiversity analyses. We will filter only valid records (flagged as TRUE) using the column “.summary”. Next, we use the bdc_filter_out_falgs function to remove all tests’ columns starting with “.”).
output <-
check_pf %>%
dplyr::filter(.summary == TRUE) %>%
bdc_filter_out_flags(data = ., col_to_remove = "all")
#>
#> bdc_fiter_out_flags:
#> The following columns were removed from the database:
#> .scientificName_empty, .coordinates_empty, .coordinates_outOfRange, .basisOfRrecords_notStandard, .coordinates_country_inconsistent, .summary